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Abstract:  In  an  automatic  pattern  recognition  system,  the  processor  that 
selects  and  measures  features  of  the  data,  on  the  basis  of 
which  c lass i f i ca t ion  is  made,  is  called  a "feature  selector" 
or  "feature  extractor".  This  paper  presents  a mathematical 
programming  approach  for  the  design  of  a feature  extractor. 
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DESIGN  OF  OPTIMAL  FEATURE  EXTRACTORS  BY  MATHEMATICAL  PROGRAMMING  TECHNIQUES* 


by 

Rui  J.  P.  de  Figueiredo 
Rice  University,  Houston,  Texas  77001 

1 . Introduct ion 

The  "feature  extraction"  operation  plays  a very  significant  role  in  the 
functioning  of  a pattern  recognition  system.  For  this  reason,  considerable 
attention  has  been  devoted  to  the  feature  extraction  problem  in  the  pattern 
recognition  literature  (see,  for  example,  [l]  and  [2]  and  the  references 
therein).  However,  most  of  the  existing  techniques  for  feature  extractor 
design  rely  on  the  maximization  of  some  average  distance  measure  amongst 
pattern  classes  in  the  "feature  (transformed)  space." 

More  recently,  it  was  proposed  by  the  author  [3]  that  the  performance 
of  the  entire  pattern  recogn.cion  system  ought  to  be  taken  into  account  when 
selecting  the  optimal  feature  extraction  transformation.  According  to  this 
point  of  view,  the  structure  of  the  desired  feature  extractor  would  be  tuned 
to  the  classification  strategy  adopted  in  a given  problem.  In  particular, 
if  the  classification  strategy  were  Bayes,  the  optimal  feature  extraction 
transformation  would  be  the  one  that  would  minimize,  over  a suitable  class 
of  admissible  transformations  the  Bayes  risk  (probability  of  m i sc  lass i f i ca - 
tion)  in  the  feature  space.  The  problem  thus  posed  becomes  essentially  a 
mathematical  programming  problem. 

In  what  follows,  we  formulate  precisely  the  above  problem  in  a framework 
sufficiently  general  to  permit  the  use  of  statistical  and/or  linguistic  con- 
siderations- Then,  for  the  case  in  which  the  classification  strategy  is  Bayes, 
we  discuss  the  important  design  considerations  and  cite  some  specific  results. 


v 

TT 


Fig-  1-  Block  Diagram  of  a Pattern  Recognition  System. 


♦Supported  by  the  AFOSR  Grant  75-2777. 
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2 . Mathemat ica 1 Formu I at i on 

The  general  configuration  of  a pattern  recognition  system  is  depicted  in 
Fig.  1.  It  consists  of  four  blocks  representing  respectively  a system  8 0f 
sensors,  a preprocessor  P,  a feature  extractor  G,  and  a classifier  C.  |n  an 
actua  1 hardware  imp lemen tat  ion, some  of  these  blocks  may  overlap  or  be  combined 
i nto  a single  unit. 

The  sensing  system  8 converts  the  excitation  from  a "pattern"  v being  percei- 
ved by  8 into  some  form  of  raw  data  w.  This  data  is  usually  contaminated  by 
"distortion"  and  "noise"  introduced  by  the  sensing  devices  and  the  environment- 
The  preprocessor  P simply  removes,  to  the  extent  possible,  this  distortion  and 
noise  from  w by  means  of  some  "cleaning"  (filtering,  enhancement,  restoration, 

. . . ) operation.  Thus  the  output  x from  P is  what  may  be  called  "clean 
data"  or  "preprocessed  signal".  The  feature  extractor  G then  measures  the 
values  of  a set  of  variables  pertaining  to  x called  "features".  Hopefully 
these  variables  contain  most  of  the  information  needed  for  recognition  purposes. 
Finally,  if  y denotes  the  output  of  G,  the  classifier  C assigns  y to  some 
pattern  class  HJ , and  thus  the  recognition  operation  is  completed. 

It  may  be  remarked  in  passing  that,  in  some  pattern  recognition 
literature  (see,  for  example,  [l],  p.7),  the  preprocessor  P and  the  feature 

extractor  G are  considered  to  be  one  and  the  same  entity.  Here,  we  make 

the  distinction  between  P and  G,  in  the  sense  that  P performs  a "signal  pro- 
cessing" operation  on  the  raw  data  with  the  objective  of  essentially  opti- 
mizing the  s i gna 1 -to-noise  ratio  without  necessarily  taking  the  ultimate  use 
of  the  signal  into  account  (many  of  the  so-called  "picture  processing"  papers 
deal  with  this  problem);  on  the  other  hand,  the  objective  of  G is  to  provide 
measurements  on  the  (filtered)  signal  solely  for  the  purpose  of  recognition. 

Let  us  now  attempt  to  formulate  precisely  the  problem  under  consideration. 

Let  TT  denote  the  set  of  a 1 1 patterns  v to  be  sensed  by  8,  Assume  that 

TT  may  be  partitioned  into  M pattern  classes  h',  . . .,  HM,  and  let  the  set 

(H1 H*}  be  denoted  by  H.  Each  member  of  H i s to  be  classified  as 

belonging  to  some  HJ . 

Let  r,  $ and  H denote  the  sets  of  outputs  from  respectively  8,  Pt  G, 
and  C. 

As  done  earlier  in  this  section,  members  of  T,  ¥,  and  $will  be  denoted 
respectively  by  w,  x,  and  y.  Any  given  output  from  C is  a statement  saying 
that  a pattern  v being  perceived  by8belongs  to  some  class  hJ  • We  will 
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denote  such  a statement  simply  by  HJ . The  set  of  all  H-*  , j=  1 , M, 

constitutes  the  set  H mentioned  above. 

In  order  to  describe  the  operation  of  P(  G,  and  C we  introduce 


respectively  the  maps 

S:  n - T,  (1) 

p:  r - r (2) 

a:  r - $ (3) 

C:  * - hf  (4) 

The  operation  of  the  pattern  recognition  system  may  now  be  expressed 
by 

Hp  = C(A(P (S (v) ) ) ) (5a) 

A K (v).  (5b) 

s 

. . . ~j  . 

It  ought  to  be  noted  that  in  (5a)  the  superscript  j on  H is  generic. 


Ml  m] 

that  is,  depending  on  v,  tr  could  be  any  one  of  the  statements  H , H . 

In  terms  of  the  above  notation  then,  we  will  call  a given  set  a 

"pattern  structure".  Also,  given  any  "pattern  recognition  system",  we  will 
refer  to  it  by  the  set  of  maps  { S , P,  A,  C) , or  simply  by  the  corresponding 
composition  map  K,  which  describes  it. 

In  our  formulation,  we  will  assume,  as  in  most  practical  cases,  that 
S is  fixed  because  of  hardware  constraints,  and  so  is  P since  the  design  of 
P is  conditioned  by  the  structure  of  S,  as  we  pointed  out  earlier. 

The  same  is  not  true  with  regard  to  the  two  remaining  maps  A and  c 

* 

Clearly,  A is  not  fixed  since  our  very  objective  is  to  select  a A 5 Ywhlch 
is  optimal  with  respect  to  all  maps  A belonging  to  an  admissible  class  X* 

Usually  the  classification  strategy  is  selected  beforehand.  However, 
since  the  domain  of  C depends  on  A,  the  structure  of  the  classifier  itself 
wi 1 1 vary  wi th  A.  To  s i gna  1 this  fact  we  will  rep  lace  the  symbol  C by 

V 

One  final  consideration  is  the  inclusion  of  training  sets  in  our 
problem  formulation.  Such  sets  constitute  the  main  source  of  information 
on  a given  pattern  structure,  on  the  basis  of  which  a recognition  system 
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for  that  structure  can  be  designed.  We  will  denote  by  A1  the  available 
training  set  pertaining  to  the  pattern  class  LL*  , j=l,  •••,  M-  Elements'of 

A*  will  be  denoted  by  u"*  , and  when  necessary  to  d i st  i ngu  i sh  • these  elements 
among  themselves  we  will  number  them  with  subscripts,  thus  u^ , •••, 

u-*  , where  N.  is  the  total  number  of  training’  samples  i n A*  . 

j ■* 

In  a conventional  way,  we  will  assume  that  the  elements  in  A?  f j=l, 

...,  M,  are  independent  (with  respect  to  some  underlying  probability  measure), 

and  that  each  A*  is  partitioned  into  two  subsets:  a subset  A*  ^ is  used  in 

the  construction  of  the  structure  of  the  classifier,  and  a subset  A1^  used 

in  the  design  of  the  map  A.  Let 

A = a’uaPu  - . .UAM  (6) 

and  define  the  partition  of  A into  A and  A^  where 

Ac  • ACU  Acu  • • • UAt-  <7> 


aa-  AaUAaU  ••••  <8> 

To  indicate  that  the  structure  of  the  c lass i f i er  i s based  on  A we  will 

C 

replace  by  C^. 

Let  a function  £ from  HxH  to  the  reals  be  defined  by 

r /TTJ  -Htv  . . . . (9) 


5 < TP  . H ) - l-6jk,  j,k.| M, 


where  6 = Kronecker  delta, 

jk 


Then  the  total  cost  (risk  or  probability)  of  mi  sc  lass i fi cat  ion  when  all 
the  training  samples  in  A are  presented  to  a pattern  recognition  system 

{S»  P*  A*  CAA  J is 

C M 

Q(A)‘I  1 CAA  <'0) 

J"  ^ 

% 

where  the  nonneoative  constants  O'.,  ...  are  suitable  "cost"  weiqhts.  The 

J k( i ) 3 


r 
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subscript  k(i)  on  Q'jk(i)  is  the  superscript  on  the  output  H*  of  the  pattern 

recognition  system  with  ir!  as  input,  i.e.  = C (A(P(S(u-j)))).  (11) 

■ AAC  ‘ 

The  optimal  feature  extractor  design  problem  now  reduces  to  the  following: 
Problem.  Let  all  the  symbols  be  defined  as  above.  For  the  purpose  of 

finding  a system  .(S,  P,  A,  C)  to  recognize  a pattern  structure  (t.,  h)  , suppose 
that  you  are  given  S,  P,  A,  a c lass  i f teat  ion  strategy  C,  a set  of  weights  O'. 

j,k*l, ■ M,j  A . and  a class  X of  admissible  maps  A.  Find  a A*€  -y  which 

minimizes  Q(A)  defined  by  (10)  over  all  A € 

At  this  point,  the  following  observations  about  our  formulation  are. 
in  order: 

(i)  In  selecting  the  feature  extraction  mapj  we  are  considering  the 
performance  of  the  entire  pattern  recognition  system. 

(ii)  Except  for  the  mild  measurability  condition  assumed  bn  the  training 
sets  needed  to  justify  our  criterion  functional  (10),  no  restrictions  have  been 


imposed  On  the  ranges  and  domains  of  the  four  maps  constituting  our  pattern 

recognition  system.  So  our  formulation  may  be  used  when  the  pattern 
recognition  system  under  cons i de rat i on  is  described  either  by  operators 
in  linear  vector  spaces  as  in  statistical  pattern  recognition  [l],  or  by 
the  formal  language  approach  [4] . 

(iii)  We  have  developed  our  general  formulation  to  the  point  where, 
with  the  addition  of  details  pertaining  to  a specific  application,  the 
feature  extractor  design  problem  becomes  a very  well-defined  nonlinear 
programming  problem  which  can  be  readily  solved  by  any  one  of  the  standard 
algorithms  available  in  the  literature  [5]. 
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3 . Specific  Considerations  and  Results 

3.1.  The  structure  of  n,  T,  Y and  $ 

A given  pattern  recognition  application  would  determine  the  structures 
of  the  sets  n,  T,  Y,  and  ♦ . In  many  applications,  T,  Y,  and  i are  linear 
spaces,  the  dimensions  of  T and  Y being  high  and  that  of  $ low.  For  this 
reason,  the  selection  of  A is  sometimes  called  the  "dimensionality  reduction" 
prob lem. 

From  now  on,  we  will  asuume  that  $ and  Y are  linear  spaces. 


3.2.  The  characterization  of  X 

One  important  consideration  in  the  design  of  the  optimal  A*  is  the 
characterization  of  the  class  X of  admissible  transformations  A.  A ver^ 
general  class  that  we  propose  is  that  of  continuous  functions  from  Y to 
$.  We  denote  such  a class  by  X c ' Provided  the  domain  of  each  A € Xq 
is  assumed  compact,  any  member  of  X q can  l3®  represented  to  any  desired 
degree  of  accuracy  by  a polynomic  operator  [6]  [7].  In  particular,  if 
Y and  $ are  finite-dimensional  with  dimensions  n and  m respectively, 
a member  of  X maY  be  approximated  by  m multivariate  polynomials  p., 
i = 1,  — , m,  expressing  the  components  of  y = (y j , ...,  y ) 6 $ in 
terms  of  the  components  of  x = (xj,  ...,  x ) € V,  thus 


y.  = A . (x)  = p.(xr 


V’ 


(12) 


A subclass  of  such  X^  is  tl"16  class  X(-L  of  linear  transformations 
consisting  of  a 1 1 mxn  matrices. 

The  various  discrete  transforms  (e.g.  Fourier,  Walsh,  ...)  that 
have  been  used  in  digital  data  processing  are  linear  transformations  and 
may  be  used  as  intermediate  vehicles  in  composing  a subclass  of  X 

For  example,  suppose  we  pick  for  this  purpose  the  discrete  Fourier 
transform  which  we  denote  by  D-  Then  D is  a linear  transformation  from 
f to  the  span  of  an  appropriate  discrete  Fourier  transform  basis  elements. 
The  dimension  of  ^ is  the  same  as  that  of  f and  hence  equal  to  n.  It  is 
well-known  that  in  some  pattern  recognition  problems,  events  pertaining 
to  different  pattern  classes  are  better  separated  in  the  transformed 
domain  t.  For  the  purpose  of  dimensionality  reduction  then,  one  would 
follow  the  transform  operation  by  some  other  appropriate  linear  operation 
L.  For  example,  L could  select  m of  the  spectral  components  of  the 
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transformed  data  vector,  the  choice  of  these  components  being  such  that  those 
m components  containing  the  maximum  amount  of  information  needed  for  recognition 
are  selected.  In  this  case,  we  would  construct  a subset  of  consisting 

of  all  maps  A = LD , different  A's  corresponding  to  different  L's  (different 
choices  of  m spectral  components). 

3.3.  The  Bayes  risk  as  Criterion  Functional 

If  the  pattern  structure  to  be  recognized  is  modeled  p robab i 1 i st i ca 1 1 y , 
then  the  criterion  functional  (10),  with  appropriate  interpretation,  may 
be  viewed  as  the  Bayes  risk  (probability  of  mi  sc  lass i f ica t ion) . 

Thus  in  the  probabilistic  case,  for  j=l,  M,  let  fy(y/hfl  , A)  denote 

the  probability  density  function  for  the  random  vector  Y = (Yp  ....  Y ) 
conditioned  on  the  pattern  class  and  on  the  transformation  A.  Here 
we  consider  the  elements  y = (y,,  ....  y ) 6 $ as  realization  of  the 
random  vector  Y conditioned  on  one  of  the  pattern  classes.  Also,  let 
Pj  denote  the  prior  probability  for  hf*  and  the  cost  of  classifying 
a y arising  from  hf*  as  pertaining  to  H*.  Then  the  Bayes  risk  is: 


M M 

Q(A)  = E E P.  fv(y,HJ  , A)  dy, 

1=1  j=l  J 'J  Cl.  (A)  Y 

j^i 

where  Q.  (A)  is  the  Bayesian  decision  region  in  $ for  H1. 


(13) 


If 

P. . = 1 - 6..  i j=l M, 

ij  ij  ’ ’ J ’ ’ ’ 

(13)  reduces  to  the  probability  of  mi  sc  lass i f ica t ion 


(14) 


M M 

Q(A)  = E E P . 

i=l  j=l  J 0.  (A)  T 

j* 


fv(y/Hj,  A)  dy. 


(15) 


Under  (14),  the  decision  regions  ft.  (A)  appearing  in  (15)  are 


defined  by 


nj(A)  = { y€$:  g ; j (y,A)  >0  j^i,  j=l mJ\  i = 1 , . . . , M, 


where 


9jj(y,  A)  - P.  fy  (y/H1 , A)  - Pj  fy(y/wj  , A)  ■ 


(16) 


(17) 


We  assume  that  the  functions  f satisfy  conditions  that  allow  A. (A) 
to  be  well  defined  (see  [3]). 
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A number  of  ways  of  estimating  the  probability  of  error  from  the 
training  samples  have  been  reviewed  by  Toussaint  [8]. 

However,  a new  way  of  estimating  the  error  in  the  feature  space 
has  been  proposed  and  studied  by  Sagar  and  the  author  [9],  This 
approach  essentially  considers  the  discriminant  functions  g.., 
i = I,  M-l,  j = 2 M,  j>i , as  random  variables- 

For  given  fy(y/HJ,A),  j = l,...,  M,  we  define  the  conditional 
distributions  F. , j = 1,  M,  as  follows:  For  any  given  (M-l) 


real  numbers  Y,,  •••  , v we  have* 

M-  I 


F1  ( V V 


Vi  / A) 


Prob  { y:  g]2(y,A)  < Yj , ...  , g)M(y,A)  < YM_,  / y ^ H },  (18-1) 


F2  ( v, . • • • . VM_,  / A) 


= Prob  ( y:  -g]2(y  A)  < Y, , g23(y  A)  < v2 , ...  , g2M(y,A)  < VM_1 


/ y € h'}, 


(18-2) 


FM  ( vi 


••  * V/  A) 


=Prob  { y:  -glM(y,A)  < Y.,  -g,M(y,A)  < Y,  , 


■gM-l,M(Y’A)  < YM-1 


/ y € H } 


(18-M) 


It  now  follows  that  the  expression  (15)  for  the  probability  of 
mi  sclass  i f i cat  ion  is  equivalent  to** 


Q(A)  = E P.  F.  ( 0,  0 0/  A).  (19) 

j=l  J J 

In  writing  an  estimate  Q (A)  of  (19)  on  the  basis  of  training  samples, 

we  first  use  the  samples  in  Aq  to  obtain  estimates  ^y/if1  ,A)  of  the 

functions  f^y/hf*  ,A)  by  means  of  the  Parzen  kernel  [10],  [ 1 1 ] . This 

in  turn  leads  to  estimates  g„  (y.A)  of  g . j (y , A ) via  equation  (17). 

We  use  the  training  samples  in  A.  to  obtain  the  estimates  F. (. /A) 

" J 

of  the  distributions  Fj ( ./A)  defined  by  equations  (18-1)  through 

(18-M)  and  hence  the  estimates  Fj  (0,...,0/A),  appearing  in  (19). 

*y  € ^ ^ y ar  ises  from  a pattern  belonging  to  H-* '. 

**Strictly  speaking,  some  of  the  arguments,  of  F.,  j ^1 , in  (19)  should  be 
written  0-  instead  of  0,  to  show  that  we  are  referring  to  left-hand  limits 
corresponding  to  the  strict  inequalities  in  equations  (18). 
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By  means  of  the  Kiefer-Wol fowi tz  [12]  generalization  of  the  Kolmogorov  [13] 
and  Smirnov  [14]  theory,  one  can  derive  bounds  on  the  error  in  the  estimate  Q(A) 
of  the  probability  of  error  Q(A).  Details  of  this  study  will  appear 
elsewhere  [9]. 

3.4,  Optimal  Dimensionality  of  the  Feature  Space 


The  optimal  dimension  m of  the  feature  space  $ is  intimately 
related  to  the  size  of  the  training  set  A.  This  is  because,  for  sets 

A A 

A.  and  A of  fixed  sizes,  the  errors  in  the  estimates  g..  and  F.  of  the  functions 
AC  ’ ij  J 

g..  and  F.  decrease  with  m(that  is,  the  lower  the  m the  closer  the 
•J  J ’ 

A A 

g..  and  F.  to  g..  and  F. ) ; on  the  other  hand,  the  probability  of  error 
U j 'J  J 

increases  with  decreasing  m , because  of  the  loss  of  information  by 
reduction  of  dimensionality.  This  indicates  that  the  optimal  dimension 
m should  be  the  one  that  corresponds  to  the  best  compromise  between  the 
aforementioned  two  competing  effects. 


While  some  papers  have  appeared  previously  [ 1 5 ] [16]  [ 1 7 ] on  the 
dimensiona 1 i ty-versus -samp le-s i ze  problem,  we  have  studied  this  problem 
in  the  context  of  the  feature  extraction  problem  using  the  developments 
mentioned  in  the  preceding  section,  and  these  results  appear  in 
[9]  and  [18]. 


3.5.  The  Mathematical  Programming  Approach 

From  all  the  preceding  considerations  it  is  clear  that  the  problem 
of  optimal  feature  extractor  design  stated  at  the  end  of  section  2 is 
a well-defined  nonlinear  programming  problem  with  the  criterion  functional 
given  by  (10)  or  (19)  to  be  minimized,  and  the  constraints  specified  by 
the  properties  of  the  given  pattern  structure  to  be  recognized,  and  by 
the  class  X over  which  the  optimization  is  to  be  carried  out. 

Typically,  in  any  given  application,  a complete  study  and  implement- 
ation of  this  approach  would  require:  (a)  the  study  of  conditions  for  the 
existence  and  uniqueness  of  the  optimal  A*;  (b)  development  of  efficient 

convergent  algorithms  for  the  determination  of  A*;  (c)  programming  and 
testing  of  these  algorithms  on  a computer  with  simulated  and  real  data 
bases. 
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All  these  phases  have  been  carried  out  by  the  author  and  his  associates 
for  Gaussian  pattern  structures  and  for  the  class  of  linear  feature  ex- 
tractors in  [3]  [ 1 8 1 [19]  [20].  Also,  phases  (a)  and  (b)  of  the  study 
of  the  general  nonGaussian  nonlinear  case  has  nearly  been  completed 
by  A.  Sagar  and  the  author,  and  these  results  will  appear  in  future- 

4 . Cone lus ion 

A mathematical  programming  approach  has  been  described  for  the 
design  of  a processor  for  feature  extraction  in  pattern  recognition. 

The  main  consideration  in  the  development  of  the  design  algorithm 
is  the  optimization  of  the  recognition  capability  of  the  system  taking 
into  account  the  realistic  constraints  appearing  in  a particular 
app  1 i ca  t i on. 


II 
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